This assignment is about classification and clustering. We’re looking at a dataset put together from sensus data, and seeing how crime rate varies across multiple correlated variables.
The methods used are linear discriminant analysis (LDA) and k-means clustering.
Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here (https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html). (0-1 points)
library(MASS)
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
There are 506 rows with 14 columns. The dataset seems to originally have been put together to analyze housing values across the suburbs. From the paper cited on the stat.ethz.ch site: “Harrison, D. and Rubinfeld, D.L. (1978) Hedonic housing prices and the demand for clean air”, we can find that a row represents a “SMSA census tract”, so, a census area in Boston containing some number of housing units.
The columns contain social statistics related to these census areas
(e.g. crim = crime rate, ptratio =
pupil-teacher ratio), data about the housing units in the area
(rm = avg # of rooms per unit, medv = median
housing unit value, age = prop. houses built before 1940),
and data about the location of the area (e.g. dis =
weighted mean of distances to employment centers, chas = 1
if by Charles River, 0 if not by Charles River).
Some of the columns are a little counter-intuitive or difficult to
interpret. E.g., the column age is the proportion of houses
built before 1940, and the column lstat is the proportion
of the population that is lower status. From the Harrison &
Rubinfield paper, lower status means: “1/2 * (proportion of adults
without some hig h school education and proportion of male workers
classified as laborers)”.
Ok, before we move forward, we did see a small issue here, let’s
change chas from an integer to a boolean.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Boston_explore <- Boston %>% mutate(chas = chas == 1)
str(Boston_explore)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)
summary(Boston_explore)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Mode :logical
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 FALSE:471
## Median : 0.25651 Median : 0.00 Median : 9.69 TRUE :35
## Mean : 3.61352 Mean : 11.36 Mean :11.14
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10
## Max. :88.97620 Max. :100.00 Max. :27.74
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Looking at this summary, all columns are definitely not created
equal. I am mostly looking at the difference between the mean and
median, which — if they differ by much — can indicate that a variable is
not normally distributed. Some of the columns are close to normal
distributions, but e.g. zn has a median of 0 and a mean of
11.36. Other highly skewed columns are rad,
crim, tax, chas, and
black.
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Boston_explore %>%
ggpairs(lower=list(combo=wrap("facethist",binwidth=0.5))) # Have to add the lower arg so ggpairs doesn't complain
fig. 3.1, Correlation matrix
Ok, lot’s to unpack here, let’s start with the a visual check of each
variable’s distribution (the diagonal). Almost none of the columns look
normally distributed, with perhaps the exception of rm, the
number of rooms.
There are lots of interesting correlations, just looking at the
scatter plots, the three values rm, medv, and
lstat seem to have strikingly strong relationships with
each other with medv, which makes sense to me.
The rad variable, which is an “index of accessibility to
radial highways, is clearly a bimodal, or one could even say a split
distribution. A subset of areas have a much higher index than the
others, and in the scatter plots, this clearly visible line of that
higher-index population seems to consistently cover different ranges of
the other variable than the lower-index population. The effect is most
clearly noticeable in the crim, tax,
nox, dis and black scatter
plots.
dis and nox also have a strikingly
logarithmic-looking relationship.
In general, nearly every variable seems to be correlated with every
other variable, excepting the chas (area is by the Charles
river) column.
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(Boston_explore), method="circle")
fig. 3.2, Correlation matrix with corrplot
Using corrplot, we lose some information, but get a better overview of where the correlations are strongest.
We see strong correlations (large balls) between: - dis
and (zn, indus, nox, and
age) - tax and rad, and this is a
very strong correltaion, they seem to capture much of the same
variation within them - tax and (crim,
indus, and nox) - ditto for rad -
lstat and medv
Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)
Let’s run the scale function on the dataset.
Boston_scaled <- as.data.frame(scale(Boston_explore))
summary(Boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
We’ve now normalized the columns by subtracting the mean and dividing by the standard deviation such that, if they were normally distributed, they would now be standard normally distributed.
# Create quartile class
Boston_scaled$crim <- as.numeric(Boston_scaled$crim)
Boston_scaled$crime <- cut(Boston_scaled$crim, breaks = quantile(Boston_scaled$crim), include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
# Drop crim
Boston_scaled <- dplyr::select(Boston_scaled, -crim)
We’ve split the crime rate column into a categorical variable defining in which quartile of the crime rate distribution the sensus area is in.
set.seed(179693716)
n <- nrow(Boston_scaled)
split <- sample(n, size = n * 0.8)
train <- Boston_scaled[split,]
test <- Boston_scaled[-split,]
Now we’ve split the dataset into two: 80% of rows are in the training set, 20% are in the test set.
Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot. (0-3 points)
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2475248 0.2351485 0.2623762
##
## Group means:
## zn indus chas nox rm age
## low 0.85485581 -0.8956178 -0.08120770 -0.8436157 0.4413548 -0.8603410
## med_low -0.09212737 -0.2763238 0.04263895 -0.5633101 -0.1877614 -0.3785908
## med_high -0.38704309 0.1791469 0.26643202 0.3710782 0.1054813 0.4474274
## high -0.48724019 1.0170298 -0.04947434 1.0820880 -0.4230980 0.8279971
## dis rad tax ptratio black lstat
## low 0.8290333 -0.6897369 -0.7642825 -0.43374243 0.3887547 -0.77819536
## med_low 0.3902633 -0.5443053 -0.4238660 -0.08385135 0.3219258 -0.11887074
## med_high -0.3933098 -0.4088466 -0.3025121 -0.20555126 0.1295279 0.06516648
## high -0.8564037 1.6390172 1.5146914 0.78181164 -0.8395683 0.90684062
## medv
## low 0.52171440
## med_low -0.04934231
## med_high 0.14052867
## high -0.73164690
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12991855 0.717286290 -1.11409479
## indus 0.01318562 -0.275547338 0.12479393
## chas -0.08510193 -0.049959856 0.17108503
## nox 0.50033989 -0.751262885 -1.27531579
## rm -0.10015731 -0.108297341 -0.19190232
## age 0.21057791 -0.358696692 -0.14782657
## dis -0.05526417 -0.347261139 0.38293342
## rad 3.16593191 1.032967549 -0.36643042
## tax -0.01168751 -0.096843332 1.10072573
## ptratio 0.12524469 0.004459516 -0.41009220
## black -0.12747580 -0.014860435 0.07573253
## lstat 0.22137355 -0.316638242 0.31488347
## medv 0.17976929 -0.444140572 -0.22238434
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9509 0.0340 0.0151
arrows <- function(x, scale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = scale * heads[,choices[1]],
y1 = scale * heads[,choices[2]], col=color, length = arrow_heads)
text(scale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
arrows(lda.fit, scale = 1.5, color = "#ee8855")
fig. 5.1 LDA biplot
We see that out of the two first linear discriminants, LD1 nearly
perfectly separates the data into two clusters: those with high crime
rate, and those with other values. rad has the clearly
highest coefficient in LD1, which can be seen both from the biplot and
the LDA summary.
LD2 seems to find another axis within the data that explains a
smaller effect. The largest coefficients in LD2 belong to
nox, medv, and zn.
# Drop the result variables
facit <- test$crime
test <- dplyr::select(test, -crime)
Let’s predict the crime rate quartiles in the test set and cross tabulate:
# Predict classes in the test data
lda.pred <- predict(lda.fit, newdata = test)
# Do a confusion matrix
tab <- table(correct = facit, predicted = lda.pred$class)
tab
## predicted
## correct low med_low med_high high
## low 15 9 0 0
## med_low 5 13 8 0
## med_high 1 10 19 1
## high 0 0 0 21
nrow(test)
## [1] 102
and here’s the same table as a confusion matrix:
image(tab)
fig. 6.1 Confusion matrix of the LDA fit
The confusion matrix shows that the model has found an axis that aligns very well with the crime rate quartile category. Most predictions are correct (68/102), the second most common case is being off by one class (33/102) and then off by two classes (1/102).
68/102 ~= 67% is not perfect but it is a lot better than choosing by random which should give us a correct prediction 25% of the time. It looks like the model can be used to make a decent predictor for whether an area has a high or non-high crime rate (the lower left of the matrix vs. the top right), for that predictor, we would have a correct classification rate of 101/102, nearly 100%!
radlda.radfit <- lda(crime ~ rad, data = train)
lda.radpred <- predict(lda.radfit, newdata = test)
radtab <- table(correct = facit, predicted = lda.radpred$class)
image(radtab)
fig. 6.2 Confusion matrix with a single variable LDA fit
Using only rad gives us a model that seems to have
exactly the same predictive power in the high vs not
high case, but loses information in the lower quartiles.
This fits with our earlier analysis of how LD1 was mostly
rad and was able to carve out most of the high
crime rate areas from the rest.
The analysis so far suggests there are at least two clear clusters in the data, so we could just choose k = 2, but let’s check with the total within cluster sum of squares what a good choice for k would be.
I will take five samples and average them, and plot the standard deviation of the twcss for each k as error bars, this should give us a more reliable plot than just taking one sample.
# determine the number of clusters
#k_max <- 10
# calculate the total within sum of squares, take 5 samples to stabilize the variance
twcss1 <- sapply(1:10, function(k){set.seed(100); kmeans(Boston, k)$tot.withinss})
twcss2 <- sapply(1:10, function(k){set.seed(123); kmeans(Boston, k)$tot.withinss})
twcss3 <- sapply(1:10, function(k){set.seed(321); kmeans(Boston, k)$tot.withinss})
twcss4 <- sapply(1:10, function(k){set.seed(130); kmeans(Boston, k)$tot.withinss})
twcss5 <- sapply(1:10, function(k){set.seed(949); kmeans(Boston, k)$tot.withinss})
df <- as.data.frame(tibble(twcss1,twcss2,twcss3,twcss4,twcss5, k= seq(1,10)))
df$twcss <- rowMeans(select(df,c(twcss1,twcss2,twcss3,twcss4,twcss5)))
df <- df %>% rowwise() %>% mutate(twcss_var = var(c(twcss1,twcss2,twcss3,twcss4,twcss5)))
df %>% ggplot(aes(x=k, y=twcss)) +
geom_line() +
geom_errorbar(aes(ymin=twcss-sqrt(twcss_var),ymax=twcss+sqrt(twcss_var),color="red"))+
theme(legend.position="none") +
scale_x_continuous(breaks=df$k) +
scale_y_continuous(breaks=seq(1000000,10000000,2000000))
fig. 7.1 k-means twcss plot
It does look like the plot agrees that k = 2 gives a very good fit. The k-means algorithm seems to always find the same clusters here, because the error bars are attached to each other, indicating that the twcss measure is constant here.
k=3 is also a potential choice, although less clear to me. After that, however, the variance increases greatly and the twcss delta starts giving minimal returns, which indicates that there isn’t a clear structure to the data which would guide the clustering.
I will go with k=2, as that seems to match what we saw in the data earlier, and the clusters are very stable.
Boston_kmeans <- Boston %>% scale %>% as.data.frame
set.seed(101)
# k-means clustering
k2m <- kmeans(Boston_kmeans, centers = 2)
summary(k2m)
## Length Class Mode
## cluster 506 -none- numeric
## centers 28 -none- numeric
## totss 1 -none- numeric
## withinss 2 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 2 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
k2m$centers
## crim zn indus chas nox rm
## 1 -0.3894158 0.2621323 -0.6146857 0.002908943 -0.5823397 0.2446705
## 2 0.7238295 -0.4872402 1.1425514 -0.005407018 1.0824279 -0.4547830
## age dis rad tax ptratio black lstat
## 1 -0.4331555 0.4540421 -0.5828749 -0.6291043 -0.2943707 0.3282754 -0.4530491
## 2 0.8051309 -0.8439539 1.0834228 1.1693521 0.5471636 -0.6101842 0.8421083
## medv
## 1 0.3532917
## 2 -0.6566834
It seems that even this k=2 means clustering has found the high-crime rate cluster. Let’s confirm that with a visualization.
ggpairs(Boston_kmeans[c("crim", "rad", "tax", "black", "age", "medv", "dis", "zn", "rm", "lstat")], aes(color=factor(k2m$cluster), fill=factor(k2m$cluster)), lower=list(combo=wrap("facethist",binwidth=0.5)))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
fig. 7.1 k-means pairs
It seems that the model has picked two clusters that have the following relative position to each other:
So it seems we have found a blue cluster with a lot of business activity (high tax rate, close to employment centers), with access to arterial highways, and high density building (lower number of rooms, zoned for smaller plots) and a red cluster of areas with less business activity, in relatively quiet regions with longer commutes and more working class or non-educated people, and a much higher proportion of black residents.
So it seems like the red regions are new developments, new suburbs
farther away from the city, and there may be some price discrimination
in the house prices (medv) connected with the high
proportion of black residents living there. You can see effects of
segregationist US housing policy in the data. E.g., the 1949 housing act
set up a framework to subsidize public housing for whites with clauses
forbidding resale to black people, which means that black people paid
more for housing (see e.g. “Abramovitz & Smith, The Persistence
of Residential Segregation by Race, 1940 to 2010: The Role of Federal
Housing Policy,Families in Society, Vol. 102, Issue 1” for
more).
Why not try with k=3 for good measure? Maybe we can find additional structure in the data.
set.seed(124293)
k3m <- kmeans(Boston_kmeans, centers = 3)
summary(k3m)
## Length Class Mode
## cluster 506 -none- numeric
## centers 42 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
k3m$centers
## crim zn indus chas nox rm age
## 1 -0.4076669 1.1526549 -0.9877755 -0.10115080 -0.9634859 0.7739125 -1.1241828
## 2 -0.3688324 -0.3935457 -0.1369208 0.07398993 -0.1662087 -0.1700456 0.1673019
## 3 0.8942488 -0.4872402 1.0913679 -0.01330932 1.1109351 -0.4609873 0.7828949
## dis rad tax ptratio black lstat
## 1 1.05650031 -0.5965522 -0.6837494 -0.62478941 0.3607235 -0.90904433
## 2 -0.07766431 -0.5799077 -0.5409630 -0.04596655 0.2680397 -0.05818052
## 3 -0.84882600 1.3656860 1.3895093 0.63256391 -0.7083974 0.90799414
## medv
## 1 0.84137443
## 2 -0.04811607
## 3 -0.69550394
ggpairs(Boston_kmeans[c("crim", "rad", "tax", "black", "age", "medv", "dis", "zn", "rm", "lstat")], aes(color=factor(k3m$cluster), fill=factor(k3m$cluster)), lower=list(combo=wrap("facethist",binwidth=0.5)))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
fig. 7.1 k-means pairs
Here we see that the blue cluster in this plot is roughly the same as the blue cluster from the k=2 clustering. The k=2 red cluster has here been split into red and green.
The differences in the red and green clusters seem to be: - the red
cluster has higher values of zn more big plots - the red
cluster has a smaller proportion of older buildings - the red cluster
consists of exclusively black neighbourhoods, while the green cluster
has some spread - the green cluster is between the red and blue clusters
when it comes to proportion of laborers and uneducated adults
Maybe the red cluster matches better with those black neighbourhoods built more recently, which the 1949 Housing Act and the Federal Housing Authority regulations apply to? I don’t know for sure, more analysis would be required.